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Abstract 

Atmospheric trace-gas inversion refers to any technique used to predict spatial and temporal 
fluxes using mole-fraction measurements and atmospheric simulations obtained from computer 
models. Studies to date are most often of a data-assimilation flavour, which implicitly consider 
univariate statistical models with the flux as the variate of interest. This univariate approach 
typically assumes that the flux field is either a spatially correlated Gaussian process or a spatially 
uncorrelated non-Gaussian process with prior expectation fixed using flux inventories (e.g., 
NAEI or EDGAR in Europe). Here, we extend this approach in three ways. Eirst, we develop 
a bivariate model for the mole-fraction field and the flux field. The bivariate approach allows 
optimal prediction of both the flux field and the mole-fraction field, and it leads to significant 
computational savings over the univariate approach. Second, we employ a lognormal spatial 
process for the flux field that captures both the lognormal characteristics of the flux field (when 
appropriate) and its spatial dependence. Third, we propose a new, geostatistical approach to 
incorporate the flux inventories in our updates, such that the posterior spatial distribution of 
the flux field is predominantly data-driven. The approach is illustrated on a case study of 
methane (CH4) emissions in the United Kingdom and Ireland. 
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Hamiltonian Monte Garlo, Spatial statistics 


‘Corresponding author 

Email address: azmOuow. edu. an (Andrew Zammit-Mangion) 



1. Introduction 


Atmospheric trace-gas inversion refers to any technique used to predict spatial and tem¬ 
poral fluxes of a gas from observations of mole fractions. Since mole-fraction measurements 
are affected by weather patterns that are time varying, there is no straightforward relationship 
between the observations and the fluxes. The relationship, termed a ‘source-receptor relation¬ 
ship’ (SRR), is typically obtained by simulating from a computer model, such as a Lagrangian 
Particle Dispersion Model (LPDM), which maps the fluxes to the mole fractions [1]. The SRR 
can be characterised through a bivariate function 6t(s, u) where the spatial locations s,u G 
are in a given domain of interest, D C and tG{l,2,...}isa discrete-time index. 

Figure top-left panel, shows the UK and Ireland methane (CH 4 ) total fluxes by grid cell, 
in units g s“^, obtained from a combination of flux inventories. Here, the component inventories 
are principally the National Atmospheric Emissions Inventory (NAEI) [2] and the Emissions 
Database for Global Atmospheric Research version 4.2 (EDGAR) [3]; see [1] for details. Eigure[^ 
top-right panel, shows an evaluation of bt{s, u), in units s ng“^, of methane fluxes, for s located at 
a mole-fraction monitoring station at Angus, Scotland (TTA), where u takes values on a discrete 
lattice, and where t = 1 (01 January 2014 at midnight). Eigure[^ bottom panel, shows the two- 
hourly averaged observations in parts per billion (ppb) at TTA, following background-removal 


(see Section 4.2 for details), for all of January 2014. In atmospheric trace-gas inversion, the aim 
is to recover a flux map, such as that in Eigure[^ top-left panel, from time-series observations 
taken at various monitoring stations and from the collection of atmospheric simulations (one 
such is shown in Eigure[^ top-right panel) that establish the SRR. 

Atmospheric trace-gas inversion is an ill-posed problem. Gonsequently, “small uncertainties 
in the observational data correspond to much higher uncertainty in the emission[s]” [5]. However, 
in addition to observation and flux-field uncertainties, there is a third source of uncertainty 
arising from the use of an atmospheric model that does not perfectly match the true SRR due 
to physical parameterisations, solver discretisations, etc. Sometimes this third term is called 
the discrepancy term, and failure to acknowledge it can lead to over-confident predictions PE!- 
Until very recently, this discrepancy was not considered separately from the observation error; 
see H]. However, it is crucial to distinguish between the errors due to observations and those 
due to model misspecification, as these are likely to have different statistical properties. 

A critical contribution of our work is to formalise the insight in [3] and treat the mole-fraction 
field as a second variable of interest. The consideration of modelling a mole-fraction field in 
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Figure 1: Top-left panel: Total flux by grid cell in g obtained by combining methane inventories (see 
ID for details) for January 2014. Top-right panel: The source-receptor relationship on 01 January 2014 at 
00:00, bi{sTTA, •), obtained from the UK Met Office’s Lagrangian Particle Dispersion Model (LPDM), Numerical 
Atmospheric-dispersion Modelling Environment (NAME), where stta is the coordinate vector of the Angus 
measurement station (TTA), in Scotland. Also shown are the three other stations used in this study, Mace Head 
(MHD), Ridge Hill (RGL), and Tacolneston (TAG). Bottom panel: Measurements of methane mole fraction in 
parts per billion (ppb) at TAG for January 2014, following background-removal (black dots), together with a 
straightforward prediction (red line) using NAME and the methane flux inventory. Each time step corresponds 
to an interval of 2 h. 


addition to the flux field through a discrepancy term results in a bivariate model (see [8] for a 
recent review on such models). The bivariate model brings two advantages to this problem of 
trace-gas inversion: The locations at which the mole-fraction field is modelled need not coincide 
with the data points. This in turn allows the predictive distribution of mole fraction at any 
unobserved locations to be found, and then averaged over any subset of the spatio-temporal 
domain, with relative ease (provided the SRR is available for these locations/domain). The 
other advantage is that the decoupling leads to computationally efficient methods in spatial 
statistics that can be used to scale up the inversion to large, remote-sensing datasets. 
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Another contribution of our work is to introduce the lognormal spatial process as a prior 
distribution for the flux field. This model acts as a bridge between the two sides of the di¬ 
chotomous literature that either assumes spatial (possibly truncated) Gaussian-process priors 
(e.g., M) or spatially uncorrelated lognormal priors (e.g., |10jl. A lognormal spatial process is 
attractive, as it is able to capture both (i) the nonnegativity and heavy tails in the distribution 
of the flux (valid for some trace gases such as methane) and (ii) the spatial correlation of the flux 
field. We show that expectations and covariances for both the flux field and the mole-fraction 
field can be obtained analytically if the flux field is defined as a lognormal process, by directly 
applying results from the univariate case (mo p. 135]. 

The third contribution of our work is to propose a new way to carry out assimilation in 
atmospheric trace-gas inversion. Frequently, the prior expectation of the flux process is set 
from one or more inventories that are many times unreliable and that may have unquantifiable 
effects on a posterior assessment. Here, we only use characteristics of the inventories, namely 
the spatial length scales and the marginal variance, while setting the prior expectation to be 
spatially constant. In this way, the inventory fluxes are not used directly in the assimilation, and 
the spatial distribution of fluxes obtained from the posterior expectation will be predominantly 
data-driven. Our contribution addresses a concern in m Section 5.2] that suggests that such 
an approach is difficult when prior distributions are non Gaussian. 

The article is structured as follows. In Section]^ we discuss the three contributions outlined 
above. In Section]^ we detail our approach to inferring the fields of interest using a combination 
of approximate inferential methods. The proposed framework is then assessed in Section]^ first 
in a study in one-dimensional space with simulated datasets, and then for emissions prediction 
in the UK and Ireland using the four measurement stations illustrated in Figure top-right 
panel. Section contains conclusions and an outline of future research directions. 

2. Theory 

The notation we use corresponds to that commonly found in the spatial-statistics literature 
(e.g., [Mj). Here, stochastic processes are denoted using regular typeface, while bold typeface is 
used to denote vectors. The symbol Z denotes observations while Y denotes processes (fields). 
The notation Y(-) is shorthand for ‘the process Y over the given spatial domain’ while Y{s) 
is the process Y evaluated at a specific location s. The subscripts m and / are used for fields 
or domains associated with the mole-fraction field and the flux field, respectively, while the 
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subscript t denotes a discrete-time index. Therefore Yf{-) denotes the flux field, while 
denotes the spatio-temporal mole-fraction field at time-index t. Spatial domains are notated 
using the letter D] the superscript O is used to denote observation domains and the superscript 
L is used to denote process (discretised onto a lattice) domains. The cardinality of a set is 
denoted using | • |, while {A\B\ denotes the conditional probability distribution of A given B. 

For simplicity, as in mm, we will assume throughout that the flux is a spatial process 
over a short time period; the development of models for spatio-temporal flux processes will be 
considered elsewhere. Recall that we use the subscript / to denote ‘flux,’ however because in 
this paper we constrain the fluxes to be positive a priori, we use the words ‘flux’ and ‘emissions’ 
interchangeably. 

2.1. Bivariate modelling: Treating mole fraction as a second variate 

In this section we discuss the univariate- and bivariate-modelling approaches to atmospheric 
trace-gas inversion, highlighting the key differences between the two. This discussion is used to 
motivate the advantages of the bivariate model, namely (i) interpretability, (ii) the possibility 
to compute the predictive distribution of mole fraction anywhere in, or averaged over any subset 
of, the domain of interest, and (iii) computational benefits arising from the decoupling. 

Denote the mole-fraction observation at location s and time t as Zm,t(s), and define 
as the set of measurement locations at time t. Further, denote the collection of all observation 
locations as where T = {!,...,T} is the temporal domain and T > 1 is 

the number of time steps. There are typically very few or no flux observations; flux prediction 
usually proceeds by using the following univariate model (or its discretised equivalent) for 
inversion p [El uni El [IT]: 

= [ bt{s,u)Yf{u)du +em,tis); f E T, (1) 

Jd 

where D is the spatial domain, em,t{^) is (Gaussian) observation error, bt{s, u) is the interaction 
function, or SRR, that returns quantities with dimensions [time] [mass]and Yf{-) is the 
flux field that returns quantities with dimensions [mass] [time][area]The field T/(-) is a 
stochastic process that, as we discuss below in Section [2^ has statistical properties that reflect 
those of the underlying process (e.g., positivity). Note that {lj(u) : u E D} is the flux density, 
and it is different from the flux by grid cell {74(u)lj(u) : u E D^} (where ^(u) is the grid 
cell area at u and is some discrete subset of D) that returns quantities with dimensions 
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[mass] [time] Note also that there is no mention of the complete mole-fraction field in 0, 
only observations of the mole-fraction field at specific locations. 

In our bivariate model, the object of interest shifts from the observations of mole fraction 
to the mole-fraction field itself, which we model as 

y-mA^) = f bt{s,u)Yf{u)du +s e teT, (2) 

Jd 

where Ym,t{-) is the mole-fraction field and is a user-defined set of discrete locations in 
D at which Ym,t is evaluated. The additive component, Ct(s), captures the spatio-temporal 
discrepancy (due to mole-fraction background) and variation in the true mole-fraction field 
that cannot be explained by the flux held through the SRR. Note that it is different from 
in 05 which represents all the unexplained signal, since it explicitly accounts for model 
misspecihcation. Finally, the observations available for the inversion are not the held itself, but 
a noisy, incomplete version of it: 

— Ym,t{A Y ^ ^ ( 3 ) 

where £m,t{') is a measurement-error process. 

There are three important differences between 0 and First, 0 is dehned on while 
0 is only dehned for ^ C D] that is, the locations at which mole fraction is inferred need not 
(but could) coincide with the observation locations. Second, due to the decoupling, the locations 
s G at which the SRR is evaluated are not dependent on the observation locations; the case of 
\D^\ <C \D^\ occurs for regional studies that use computationally-intensive LPDMs to simulate 
the SRR. Third, the mole-fraction observations do not appear in 0. This is computationally 
benehcial as it is usually reasonable to assume that observations are conditionally independent 
given the mole-fraction held, but not necessarily given only the hux held. This conditional 
independence implies that the conditional covariance matrix of {Zm,t(s)} given the mole-fraction 
held is diagonal, resulting in simplihed matrix computations. 

Consider a discretisation of D for the hux held, denoted (that need not be identical to 
DA)- Dehne 
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^m,t — ■ S E , 

'^m,t = {Zm,t{^) '■ S E D^-i-)', 

Gm = {^m,t ■ ^ ^ '^) ) 

Zm = (Z'^,, : t E T)', 

= (£m,t(s) • S E D^ -f.) , 

Ym,t = {Ym,t{^) : s E 

= (^m,* ■ i £ ’7~)^ 

= (Y:„,, : t E T)', 

Ct = (Ct(s) : s E Dt)\ 

Yf = (Y;(s) : s E Djy, 

c = (c;: t G r)', 

where the ‘prime’ symbol is used to denote the transpose operation. Then, in the univariate 
model the relationship between the mole-fraction observations and the flux field is approx¬ 
imated by 

'Z‘m,t — ^U,tYf Gm,t'} t Y1 (4) 

where ^u,t = (^(u)6t(s, u) : s E Dyy ^,\i E D^) is a J x \D^\ matrix that approximates 
the integral in Q, and {^(u) : u E are integration weights with dimension, [area]. In 
the bivariate model, the relationship between the mole-fraction field and the flux field, (© , is 
approximated by 

Y^^t^^B,tY f + Ct] teT, ( 5 ) 

where 

BB,t = (A(u)6f (s, u) : s E u E D^) (6) 

is a jZZ^I X \D^\ matrix that approximates the integral in Q. The relationship between the 
mole-fraction observations and the mole-fraction field is 

= CtYjn,t T t E 7~, (7) 

where Cf is a x |Z7^| incidence matrix that indicates where the observation locations 

D^ t are in D^. 

The first advantage of the bivariate approach is the ability to compute the predictive distri¬ 
bution of the mole-fraction field at locations, or over space-time domains, where the field is not 
directly observed (note that the SRR still needs to be available for these locations or domains). 
In the univariate approach, Q, we are able to compute (through a standard application of 


7 


Bayes’ rule), the probability distribution of the flux field given the mole-fraction observations, 
that is, [Y f I Zm]- Following this, we could try to obtain the predictive distribution of the mole 
fractions at time t through f \ Zm], but this does not take into account the discrepancy 

In the bivariate case, with ([^ and 0 we can compute the joint posterior distribution 
[Y f,^m I Zm\ at locations Dm x , from which we can in principle obtain both the posterior 
distribution of the flux field [Y f \ Zm] and that of the mole-fraction field [Y^ | Zm\ ■ We may 
also easily find the posterior distribution of any affine transformation of the random vector Y^, 
and hence it is straightforward to carry out inferences over a spatial or temporal aggregation of 
the mole-fraction field if desired. 

The second advantage of the bivariate approach is computational. In many applications it is 
reasonable to assume that the observation error Em is uncorrelated. However, this assumption 
does not hold for (in the univariate case), since em,t{^) should incorporate spatio-temporal 
correlations arising indirectly from the discrepancy term m- Since the observations Zm are 
(in practice) spatio-temporally irregular, there is no straightforward way, without modifying 
the underlying model (e.g., through covariance tapering [l8]), to define cov{em,t) such that it is 
sparse or such that it can be decomposed into simpler components. On the other hand, Ym,t{-) 
(equivalently Qi')) can be discretised in such a way that covariance matrices associated with 


Ct(s) can be easily stored and inverted. For example, in Section 3.2 we use a regular space-time 
grid for the discretisation, and we exploit the properties of the Kronecker product to simplify 
the computation as in |1]. 


2.2. The lognormal bivariate model and its properties 

To date, most studies have assumed either that Yf{-) is a Gaussian process or that Yf{-) 
is spatially uncorrelated and non-Gaussian. Non-Gaussianity is generally important to model 
since the spatial distribution of fluxes tends to exhibit a heavy right tail and, for some trace 
gases such as methane, it is reasonable to further assume that it has only positive support. 
However, it is easy to see from flux inventories that often the spatially-uncorrelated assumption 
is unrealistic. In this section we present a new model, namely a lognormal spatial process, that 
can cater for both non-Gaussianity and for spatial correlations that appear in the flux field. 
We then embed this in the bivariate model and discuss the model’s resulting first-order and 
second-order properties. 

Assume that Yf{-) is a spatial process such that Yf{-) = InYj(-) is a Gaussian process [H]. 
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Then T/(-) is termed a lognormal spatial process; clearly, P(Yf{s) < 0) = 0, for s € D, so that 
!/(•) is almost surely positive everywhere. Assume further that E(lj(s) | r?) = /U/(s | i?) and 
cov(l/(s), Yy(u) I r?) = Cjf{s,u \ i9), where i? is a vector of unknown parameters. Then from 

m, 


/i/(s I I?) = E(Yy(s) I i9) = exp(p,f(s | ■&) + (l/2)C'^(s, s | i?)); s e D, ( 8 ) 

C'^(s,u I I?) = cov{Yf{s),Yf{u) I ■&) = pf{s I r9)/i/(u | 'i9)[exp(Cjy(s,u | i9)) - 1]; s,u£ D. (9) 


That is, the mean and covariances of the lognormal spatial process are adjusted versions of 
the exponentiated mean and covariances of the Gaussian spatial process, and thus they can be 
easily computed. 

We only need to specify two components in Q and (§: the prior mean and the prior 
covariance functions in the ‘log space,’ that is, Jlf{s \ i?) and Cjf(s,u \ '&); s,u £ D. The way 
in which we do so does not differ substantially from when Gaussian spatial processes are used 
to model the prior and, in this regard, research to date falls into two broad categories. Eirst, in 
works that employ a data-assimilation framework, /U/(s | is typically set from values in a flux 
inventory, whilst Cff{s, u | r9); s, u G D, is chosen through expert judgment (e.g., [2I])- Second, 
in works that employ a geostatistical framework, ^/(s | i9) is defined as a linear combination 
of regressors, whilst Cff{s,u \ •&)■, s,u G T), is a standard univariate covariance function (e.g., 
the stationary and isotropic exponential covariance function [9l[22]). In Section 2.3 we offer an 
enhanced approach to eliciting the first-order and second-order moments of Yy(-). 

Given Q, we are able to obtain closed-form expressions for the expectation and covariance 
function of the mole-fraction field at time t. These are, 

I 'd) = E(ym,t(s) I'd) = [ bt{s, v)nf{v \ i?)dv E(Ct(s)); s£ D, (10) 

Jd 

where we deliberately allow the possibility that E(^i(s)) ^ 0 ; and 


u I 0 , d) — cov(yjyi j(s), Y^ ^i^u) | 0 , d) 

= C(;^,i(s, u I 0) / [ bt{s,v)Cff{v,w \ d)bt{u,w)dvdw; s,ugT>, (11) 
Jd Jd 

where C'^^^f(s,u | 0) = cov(Ct(s), Ci(u) | 0), and 0 is a vector of unknown parameters appear¬ 
ing in the covariance function of the discrepancy term. Again, from ([^, the cross-covariance 
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functions between the mole-fraction field at time t and the flux field are 


I 1 ?) = cov(Yf{s),Ym,t{'ii) {'&)=[ Cff{s,w \ (u, w)dw; s,u€ D, (12) 

JD 

C'm/,t(s,u I 1 ?) = cov(ym,i(s),l/(u) I i9) = / Cj(w,u I i9)6t(s,w)dw; s,\ieD. (13) 

JD 


Equations (10)-(13) are identical to those given in [23] with and Cjj given by Q and 
(§, respectively. Now we can write down the expectation and covariance of the joint model at 
time t as 


Dist 



/ 


CmmA' 1 ' 


(14) 


(■) 


where Yt(-) = Dist(-,-) is a distribution that here is not Gaussian but which 


features the first two moments, and the terms in (14) are given by (10)-(^13E We stress that, 


unlike in a bivariate Gaussian process, here the joint model [Yi(-)] given by (14) is not fully 
specified by the mean and covariance functions. 

We have derived above the expectation and covariance of Y 4 (-), that is, of the bivariate 
process at a single time point t. In practice, we need to find the expectation and covariance of 


Y(.) = (Y/(-),(Y^,t(-):tGr))'. 


(15) 


These can be obtained using the same ideas used to construct (10)-(13) and the mole-fraction 
covariances coY{YmA')^^m,t'{-))]t,t' G T. 


2.3. Inventories for estimating the spatial lognormal characteristics of the flux field 

One would ideally estimate from the data all unknown parameters appearing in the model. 
However, this can be difficult for parameters affecting fields that are not directly observed (in 
our case Y/(-)) arid in ill-posed problems such as that considered here. In other fields of study, 
this problem has been rectified by estimating the parameters from deterministic models. In |24j . 
for example, parameters appearing in a dynamic stochastic model of aerosol optical depth were 
estimated from a chemical-tracer numerical model. In [25|, the varying spatial-length scales of 
precipitation patterns were estimated from a regional climate model. In the present study, we 
propose estimating the spatial-length scale and the marginal variance of the process Y/(-) from 
flux inventories. Importantly, unlike several works in atmospheric trace-gas inversion, we do 
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Figure 2: Left panel: Histogram of total methane flux by grid cell in the UK and Ireland (see Figure top-left 
panel). Right panel: The empirical (open circles) and fitted (solid line) isotropic semivariogram of the methane 
flux as a function of lag distance in degrees (latitude and longitude). 


not fix the prior expectation of the flux process to be equal to values given in an inventory, 
since it is often found that inventories can be biased due to errors in activity data or emissions 
factors. Instead, we set = c, where c G M is a spatially-invariant constant evaluated below. 
We now show the procedure for obtaining c and the parameters in •) from the inventory 
shown in Figure top-left panel. 

The flux field is generally spatio-temporal and not isotropic, but here we only seek to capture 
the dominant spatial dependence using a simple spatial isotropic model; the spatio-temporal dis¬ 
crepancy term Ci(s) accounts for any remaining variability. First, we divide the total flux by grid 
cell in our inventory (shown in Figure[^ top-left panel) by the grid cell area in order to obtain flux 
densities, which we collect into a vector Then, we let = InZj^,;, elementwise. Note 

that Yinv can be considered dimensionless since, for any constant k, = Ink + \n. 

where is the Th element of 'Linv A histogram of the elements in Tjinv is given in Figure 
left panel, while the empirical isotropic semivariogram based on is shown in Figure 
right panel. 

To obtain Cff{s,u \ i?), we first fit a semivariogram, 7 °(/i | i?), to the empirical semivari¬ 
ogram shown in Figure]^ where /i = ||u — s||, and is the vector of parameters appearing in 
the specification of the semivariogram. Parameter estimates i? were obtained using weighted 
least squares with f it. variogram in the package gstat with R Software p6l \27\ . We fitted 
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three semivariogram models: the spherical, the exponential, and the Gaussian semivariograin. 
From these three, the spherical model was the one that gave the best fit in terms of smallest 
sum-of-squared errors and thus is the one we proceeded with in our analysis. For the spherical 
semivariogram, 'd = (cjf, crl, (t\ is the dimensionless variance of a fine-scale variation com¬ 
ponent (nugget), £72 is the dimensionless variance of a spatially correlated component (partial 
sill), i? is a range parameter in degrees (latitude and longitude), and 


l°{h I 1?) = < 


al + al{{2,h)/{2R)-{l/2){h/Rf), 
af + cr|. 


h = 0; 

0 < h < R] 
h > R. 


(16) 


The spherical semivariogram has a hnite sill, af + a'^, and hence there is a stationary covariance 
function m Section 2.3.2], 


Cff{s, u\-d) = al + a^- 7 °(||u - s|| | i9). (17) 

The parameters corresponding to the fitted semivariogram, shown in Figure right panel, 
are = (0.0053,0.80,3.3°)'. The fitted values were used in pT| ) to obtain the covariance 
function, Cff{s,u \ r?), which is used to describe the spatial dependence in the log-flux field. 
We complete the prior specihcation of the flux held by recalling that /!/(•) = c, and setting 
c = Tj„„(s)/|Ily I = 7.35, which is a spatially invariant, dimensionless constant. We 

verihed the adequacy of the htted lognormal process through a leave-one-out cross-validation 
study and analysis of the prediction residuals in the log space. 

3. Methods 

Atmospheric trace-gas-inversion problems are computationally demanding and the bivariate 
model does not fully alleviate this. First, there are four domains to consider, T, D^, and 
Dj, the sizes of which can affect the computational efficiency and hence the strategy we employ. 
Second, even after estimating i? using the inventories, there are still unknown parameters in the 
discrepancy that need to be estimated; we denote those parameters as 6. For small problems, a 
full Markov chain Monte Carlo (MCMC) procedure to sample from the posterior distribution, 
[Y /, Ym, 0 I Zm, 1 ?], may be employed [3] and, when \T\ grows, exact or approximate algorithms 
that process the data one-temporal-interval-at-a-time may be used |28l I29| . 
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Hierarchical models where prior distributions are placed on the unknown parameters 6, are 
known as Bayesian hierarchical models (BHMs). As both \D^\ and \Dj\ increase, MCMC 
schemes may become prohibitive due to the time taken to obtain a single sample from the 
posterior distribution and due to poor mixing resulting from increased correlations between 6 
and (Yj, Y^)b To deal with this, one can implement empirical hierarchical models (EHMs, see 
|14[ pp. 20-21]) in which the parameters 6 are treated as fixed unknowns that are estimated using 
standard maximum-likelihood techniques. MCMC is then used to sample from the empirical 
predictive distribution, [Yf,Ym \ Zm, where the estimate 0 of 0 is substituted into 

the posterior distribution in place of 0. EHMs do not account for ‘parameter uncertainty’ 
in the Bayesian sense, and prediction intervals obtained for [Yf,Ym \ tend to be 

narrower than those obtained using BHMs dniEo]. Nonetheless, EHMs have been shown to 
give reasonable results at a fraction of the computational cost of BHMs [3T]. We discuss how 
adjusted intervals may be obtained for EHMs in Section]^ 

3.1. Estimating the remaining parameters 

Since the flux and mole-fraction fields are not directly observed, the likelihood of 0 is not 
available in closed form. Maximum-likelihood estimation in this context can be treated as a 
‘missing-data’ problem. Missing-data problems appear in all statistical contexts, and they are 
often solved using iterative methods. One of these methods is the expectation-maximisation 
(EM) algorithm, given in the celebrated article of [32]. Since its appearance, the EM algorithm 
has been widely employed for parameter estimation in empirical hierarchical spatio-temporal 
models (e.g., [33]), where the latent process is treated as missing data. 

The EM algorithm for estimating 0 is relatively straightforward to implement. Define the 
complete-data likelihood, Lc(0) = [Yf,Ym,'^m \ Thus, Lc{0) inherits its randomness 

from Yf,Ym, and Z^- The EM algorithm is defined iteratively through: (i) the E-step, in 
which the function 

Q(0| 0«) = E(lnLe(0) | Z™,0«,^), 

is found for some ‘current’ estimate 0^*^ and (ii) the M-step, which computes 

— argmaxQ(0 | 0^*^). (18) 

e 

The algorithm is iterated until convergence (convergence is assessed by monitoring the change 
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in the elements of 0^*^ for successive values of i). For a comprehensive treatment of the EM 
algorithm, see |34) . 

A slight modification to the standard EM algorithm is necessary in this setting: Because Y j 
is lognormal, the E-step cannot be carried out analytically. However, since Cti^) is a Gaussian 
process, it turns out that we only need to compute the conditional expectation and the condi¬ 


tional covariance of (Yf,Ym) \ (see (A.5) in Appendix A.l). We obtain these by 

finding the mode and the curvature of the log-density (thus invoking a Laplace approximation; 


see |35t p. 213]). Formulas required for the E-step are given in Appendix A.l The resulting 
(approximate) algorithm is referred to as a Laplace-EM algorithm in [31j . 


Eor the models we consider, the optimisation (18) can be carried out using gradient descent. 


Since computing the derivatives of Q{6 \ 0^*^) can be quite costly, we suggest halting the 
gradient descent prematurely for the first few M-steps in order to decrease computational cost. 
The resulting ‘generalised EM algorithm’ |34[ p. 84] converges to a stationary point 6 (when 
the E-step is exact), although it requires more iterations until convergence. 

3.2. Computationally efficient models for the discrepancy term 


As discussed in Section 2 . 1 , one reason to employ a bivariate model is the flexibility it gives 
in choosing a model for the discrepancy term. Clearly, one should choose a model for the 
discrepancy that allows it to scale well with both \T\ and \Dl^\. In this article we employ a 
separable space-time structure (e.g., HES]) that is computationally efficient and is a reasonable 
choice in the face of lack of structural information on the discrepancy term see Section]^ 

for further discussion of this choice. 

A separable covariance function is one that can be characterised through 


cov(Ct(s),Ci'(u) I 0) = cr^p^(s,u 1 e)pt{t,t' 1 0), 

where Psi-,- \ 0) is a spatial correlation function and pt{-,- \ 0) is a temporal correlation function. 
In this work, we restrict our attention to the following correlation functions: 

/ 9 s(s, u j d) = exp(—jju — sjj/d); d > 0 , 
pt{t,t'\ a) = jaj < 1 , 


where d and a are range-like parameters that determine the correlation length of the spatial 
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and temporal components, respectively (we refer to d as a correlation e-folding length). The 
derivatives of these correlation functions with respect to d and a can be easily computed for 
the M-step. Recall that, in our model, the unknown parameters 6 are all associated with the 
discrepancy term’s covariance function; therefore, 0 = {a'^,a,dy. 

Separability implies that 

^C(^) = (cov(Ci(s),Ct'(u) \6):s,u£ D^; t,t' G T) 

= cr^R^,t(a) 0 Rf,s(d), 

where 0 is the Kronecker product, R(;^ 4 (a) is a correlation matrix of size \T\ x \T\ and R^^<j(d) 
is a correlation matrix of size \D^\ x \D^\. One can take advantage of the Kronecker product 
to greatly reduce the memory and computational requirements of the E-steps and M-steps. 
Formulas required for the M-step for this model are given in [Appendix A.2 

3.3. Taking advantage of the closed-form gradient: Hamiltonian Monte Carlo 

Recall that with an EHM, we first estimate the parameters using, for example, the EM 
algorithm. Then, after fixing them at the EM-estimate, we carry out inference on the bivariate 
fields using MCMC. This section describes the second, MCMC, stage for sampling from the 
empirical predictive distribution, [Yf,Ym \ Zm,^,!?]. 

The most popular MCMC algorithm in use is a random-walk Metropolis algorithm, where 
one uses a symmetric density to propose new states in the Markov chain. Metropolis algorithms 
do not take into account the ‘shape’ of the density and, as a result, the optimal acceptance rate 
typically results in chains that are highly correlated. We can improve on the Metropolis sampler 
by using the gradient of the posterior density, which in our case can be computed analytically. 
The gradient is used to reduce the probability of proposing samples in the direction where 
the probability density drops off steeply, and to increase the probability in regions where the 
probability density is relatively smooth. Hamiltonian (or hybrid) Monte Carlo (HMC) is an 
MCMC method that takes into account gradient information and can thus be used to propose 
states that might be distant, in probability space, from the current sample and yet have a small 
chance of rejection. In practice, a trajectory for the samples is simulated using Hamiltonian 
dynamics m- 

A Hamiltonian system is described by a position vector and a momentum vector. In HMC, 
the position vector corresponds to the current sample, while the momentum vector is randomly 
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sampled from a pre-specified distribution. Hamilton’s equations are then used to simulate 
the trajectory of the current sample, under the random momentum vector, for a certain time 
interval. The potential-energy surface used is the negative log-density, and thus the trajectory 
of the sample can be thought of as a particle rolling inside a surface characterised by the 
probability density of interest. In practice, Hamilton’s equations are discretised in time using 
a suitable scheme like the leapfrog method [38], with a given discretisation interval A. Then 
the number of time steps is L = (time interval)/A. Choice of A and L are critical for good 
performance of the MCMC scheme. 

In general, A needs to be set such that the narrow, possibly curving valleys in the negative 
log-density can be explored, while L needs to be set such that the time interval, LA, is large 
enough for the proposal to be treated as independent from the current sample. This is no simple 
task, however by transforming the variables to have roughly unit scale, we can explore values 
of L and A for a suitable acceptance rate (~ 60%) such that LA ~ 1 |38j . The quantity A can 
also be randomised in order to minimise the chance of obtaining periodicity in the trajectories; 
see |38j for further implementation details. 

Since is conditionally Gaussian, conditional on Yy,0, and i?, we can marginalise out 
Ym from [Y f, Ym \ Zm, 0, 1 ?] and use HMC to sample {Y^ directly from [Y f \ Zm, For 


each Y^*\ we then sample Ym from [Y, 
scheme is known as a collapsed Gibbs sampler |39j . The log-density and the gradient for the 
collapsed distribution (i.e., [Yj | Zm, d, r?]) can be derived in a very similar way to the Laplace- 


r{i) 


m I y|*\ Zm, 0,1?], which is Gaussian. This sampling 


approximated E-step (Appendix A.l), and thus they are not detailed here 


4. Results 


In this section, we implement the model of Section]^ and the inference techniques discussed 
in Section[^in two settings. First, in Section[4.1|we use simulated datasets on a one-dimensional 


‘toy’ problem. Second, in Section 4.2 we apply the method to inferring methane emissions from 


mole-fraction observations in the UK and Ireland, as described in Section 
4-1- Study with simulated data 

In the simulation study we simulate the flux from some known prior distribution, and the 
interaction function, which describes the SRR, is synthesised from a spatio-temporal model. 


For consistency with the analysis of methane emissions given in Section 4.2, we assume that 
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the mole fractions obtained using the simulated flux held and the SRR are in ppb, that the 
spatial-length scales are in degrees, and that each time step corresponds to a temporal interval 
of 2 h. In order to facilitate the analysis and visualisation of the results, we carry out these 
simulations in one-dimensional space. Specihcally, the domain of interest is D = [—10°,10°], 
and Dj = {—9.9°, —9.7°,..., 9.9°}, so that \Dj\ = 100. The lognormal spatial process that 
dehnes the hux held has ^/(s) = 5 and Cff{s,u \ t9) given by a spherical, isotropic covariance 


function, with parameters i? set equal to those estimated in Section 2.3 


We synthesised an SRR that rehects something that would be obtained from a numerical 
model such as NAME. That is, we use an SRR characterised by a plume originating at u = s 
with a particular orientation that decays slowly with |u — s| (see Figuretop-right panel) and 
varies smoothly in time. We achieved this by simulating a how process through a truncated nor¬ 
malised Gaussian density with spatio-temporally varying scale. Specihcally, we hrst simulated 
a parameter vt{s) from a Gaussian process with separable spatio-temporal covariance structure, 
and then we dehned 

bt{s,u I vt{s)) = exp(- s| < |ut(s)|) J(s,u), (19) 


where 


J{s, u) = < 


I[{u - s) > 0]; Vt{s) > 0, 
I[{u - s) < 0]; vt{s) < 0, 


and /(•) is the indicator function. In (19), the exponential function describes a bell-shaped 
curve centred at u = s, where the indicator function truncates this curve at tt = s ± ut(s). The 
third term, J{s,u), then truncates the bottom half of the function if vt{s) > 0 and the upper 
half otherwise. A plot showing {bt{s,u) : u G t G Tj at hve locations s (that coincide with 
the observation locations, to be described later), where T = {1,2,... ,T},T = 100 steps, 
is shown in Figure top panel. 

The SRR was subsequently used to construct the matrix defined by ([^, with A(u) = 
0.2°. For the purposes of simulation, we set the true-discrepancy parameters to = 50 ppb, 
a = 0.8, and d = 1°, so that 6 = (2500 ppb^, 0.8,1°)'. We also set E(Ct(s)) = 0, thereby 
assuming that any background mole fraction has been correctly removed a priori. Using these 
parameters and assumptions, we then simulated the flux field from (|^ and (© and the mole- 
fraction held from ([^. This simulation was done only once, to give us a simulated dataset where 
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Figure 3: Top panel: The source-receptor relationship 6t(s, u) synthesised at five observation locations s € = 

{—5.3°, —4.5°, —3.9°, —3.7°, —0.1°}. Note that bt{s,u) = 0 for u > 4.3°. Bottom panel: A sample realisation of 
the flux field (solid line), the resulting time-averaged mole-fraction field (dashed line), and the five observation 
locations (arrows). 


the true fields are known. In Figure bottom panel, we show the true lognormal flux field and 
the true time-averaged mole-fraction held with = D^. 

We hrst considered the typical scenario of when we have only a few measurement stations 
where mole-fraction data are collected. For example, in Section [4. 2( we have only four stations 
in the UK and Ireland. In this simulation study in one-dimensional space we considered hve 
stations, shown by the arrows in the bottom panel of Figure whose locations were 
obtained by sampling randomly from . Data were generated at these locations using ([^, 
and we assumed that data for all time steps were available (i.e., ^ t ^T). Recall that 



18 

















T = 100 time steps, which corresponds to about 8 days of 2-hourly observations. We assume 
that we are interested in predicting the mole-fraction field at the observation locations and, for 
illustration, at another location s = 0.3°, and hence we set U {0.3°}. 

The function bt{s,u) is a measure of how sensitive Ym^tis) is to flux at u. It can be prob¬ 
lematic, in ill-posed problems such as this one, to obtain inferences for locations u when the 
sensitivity at all measurement sites ^ is zero (or approximately zero) for t £ T, since the 
influence of the flux field at these locations on the observed mole fractions is zero (or negligi¬ 
ble). Without assuming a spatial process for the flux field, these regions would not even be 
identifiable, and hence they can be justifiably removed from the model. We therefore re-defined 
the flux domain Dj to exclude these regions. Formally, we can define Dt{s) = {u : bt{s, u) > 0} 
and re-define = (—9.9°, —9.7°,..., 9.9°} n ■ 


In order to run the Laplace-EM algorithm, we chose the starting value, = (1000 
ppb^, 0.2,0.2°)', and commenced the gradient descent at Yj = exp(/ij + 0.5a‘j)l (where jlj = 5 
and a'j = Cff{s, s \ '&)) and Ym = Z^. Using 50 gradient descents at each M-step, the Laplace- 
EM algorithm converged in 40 iterations to 6 = (2000 ppb^, 0.71, 0.77°)' (convergence was as¬ 
sessed visually). Recall that the true value of the parameters were 6 = (2500 ppb^, 0.8,1°)'. We 
can expect the accuracy of these estimates to increase with the length of the observation record, 
T. Hence, when using this method in temporal blocks (e.g., mno]), it is recommended to use 
large temporal intervals; for example in the case study in Section [4. 2| we employ a three-month 
block with T ~ 1000 time steps. 

We substituted the EM-estimated 0 for 9 into [Y f, Ym \ "Zim, 9, i?] and sampled 10,000 times 
from the (empirical) posterior distribution of the flux field using the HMC sampler of Section 


3.3 with L = 10 steps and A G [0.066,0.068]; this resulted in an acceptance rate of 57%. Box 


plots showing inferences for the flux field from the HMC sampler (after discarding the first 1,000 
samples for burn-in), together with the true flux field (crosses) and the observation locations 
(arrows on the horizontal axis) are shown in Figure]^ Notice how the uncertainty increases as 
distance away from the nearest observation location increases, as expected, but also notice how 
the true fluxes conform with our posterior inferences. 

The Laplace approximation is used for parameter estimation and not for prediction of the 
fields, although predictions could be provided as by-products of the E-step at convergence of 
the EM algortihm. The reason we do not use the predictions from the Laplace approximation 
is that, while correctly locating the mode in the posterior distribution, these predictions do 
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Figure 4: Samples from the posterior distribution of the lognormal flux field obtained using Hamiltonian Monte 
Carlo (HMC). The boxes denote the interquartile range, the whiskers extend to the last values that are within 
1.5 times the interquartile range from the quartiles, and the dots show the samples that lie beyond the end of 
the whiskers. The crosses denote the true (simulated) fluxes, and the arrows along the horizontal axis denote 
the locations of the measurements. The vertical dashed lines show the spatial locations analysed in Figure 
Since the observed mole fractions are insensitive to flux at {s : s > 4.3°}, these locations were excluded from the 
model. 


not account for skewness, something that is expected from a lognormal process. The difference 
between the prediction from the Laplace approximation and that from the HMC sampler at 
s = 3.9° (far from an observation location) is shown in Figure]^ left panel. We find that this 
difference is less marked close to an observation location, since the posterior distribution is less 
sensitive to the prior distribution in these regions (e.g., at s = —5.3°); see Figure]^ right panel. 

One of the contributions of this article is to introduce a lognormal-spatial model for the flux 
field, for trace gases where fluxes are positive. We show the importance of capturing the spatial 
characteristics of the flux field by comparing our results obtained with spatial range R = 3.3°, 


to results we obtained with a misspecified, non-spatial model where R = 0.0° in (16). To assess 


the ‘goodness-of-prediction’ for both models we use the root-average-squared prediction error: 


SiJ = 


\Dj\ 


1 1/2 


l^/l S 


yf,i-Yu 


( 20 ) 
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Figure 5: Laplace approximation (dashed line) and a histogram of the empirical posterior distribution from the 
MCMC samples for the flux at s = 3.9° (left panel) and s = —5.3° (right panel). 


Table 1: Statistics S 2 J, and 5?^'^ for the simulated case studies using a lognormal uncorrelated spatial 

process {R = 0.0°) and a lognormal correlated spatial process (R = 3.3°) 

R (degrees) Sij (g degree"^) S 2 J 5?;^ (ppb) 

0.0 97 0.74 ’ 32 

3.3 62 0.83 32 


where Yf^i is the ith. element of the vector Y f and Yf^i is the posterior expectation of Yf^i 
assuming a spatial range that may be different from the true range. The lower the value for 
Sij, the better the predictive performance in terms of pointwise prediction. To assess the 
distributional accuracy of the forecast, we use 


^2,/ = 


1 1 (Yf ■ -Yf 


1 2 


\D^\ 




1/2 


( 21 ) 


where , is the posterior variance of Clearly, S 2 J should be close to 1 if uncertainty is 
correctly captured. 


In Table[^ we show the statistics (20) and (21) obtained under uncorrelated (i.e., misspec- 
ified) and correlated (i.e., correctly specified) flux-field assumptions. The statistic Sij for the 
model with the spatially correlated flux was 62 g s“^ degree”^, while for the uncorrelated model 
(with R = 0.0°), Sij was 97 g s“^ degree”^. Assuming an uncorrelated flux field results in 
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Figure 6: Distributions of mole fraction at s = 0.3° and t £ T, following Gibbs sampling. The dark and light 
shadings denote the interquartile and the 5th-95th percentile ranges, respectively. The crosses denote the true 
(simulated) mole fractions. 


a considerable reduction in performance, caused mainly by poor observability of the flux field 
at s > 0.0° (plot not shown). Because of such regions, it is important to capture the spatial 
correlations of the unobserved helds. Also, the statistic S 2 J is closer to 1 when R = 3.3° than 
when R = 0.0°. A value lower than 1 is indicative of under-confidence (an overall posterior 
variance which is too large); in our case, R = 0.0° represents a model misspecification, and we 
see the consequence in the low value of S 2 J (namely, 0.74). 

Another advantage of the proposed approach is the ease with which we can infer the mole- 
fraction field anywhere in the domain. In Figure we show the distributions of the mole 
fractions at every time point at s = 0.3°, which are consistent with the true mole fraction 


values. A statistic similar to (20) for evaluating the mole-fraction ‘goodness-of-prediction’ at 
s = 0.3° is 


qO.S _ 


1 1^1 

iri 


1 1/2 


_ ’wO.3 
i ^ m.t 


( 22 ) 


where is the element of the vector corresponding to the location s = 0.3°, and 
is the posterior expectation of assuming different values of R. Interestingly, unlike with 
S'lj, the statistic (22) was found to be insensitive to R] see Table This reflects a well-known 
consequence of the ill-posed nature of the problem: There are several ‘plausible’ flux helds that 
can reproduce the observed mole fractions. 

Next, we show that the bivariate approach scales well with dataset size by considering the 
case where \D^\ = 1000. The locations of the 1000 observations were obtained by uniform 
sampling within the domain, and the observations were generated using ([^. Since the number 
of time steps is T = 100, we have 100,000 observations in all, far more than can be handled 
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efficiently with the standard univariate approach based on 0. This problem becomes tractable 
by choosing such that | \ <C | \, which is possible only when using a bivariate model. 

Here we set and, since \D^\ = 100, we have 10,000 space-time locations at which to 

infer the mole-fraction held, which is an order of magnitude less than IH^I- Olhe Laplace-EM al¬ 
gorithm converged in 5 iterations to 0 = (2300 ppb^, 0.81, 0.89°)'. As expected, these estimates 
are more accurate due to the increased number of observations (recall that the true parameters 
were 6 = (2500 ppb^, 0.80,1.00°)'). Increased accuracy was also noted in the posterior mode of 
the hux held, which was virtually identical to the true hux held at all locations in . 

Although it seems relatively straightforward to adopt this separable geostatistical model 
with large datasets, since both and are dense, one cannot use it when either |T| or 
become larger than, say, 2000. One way to remedy this limitation is to employ sparse 
inverse covariance matrices or reduced-rank covariance matrices instead; we provide further 
discussion on one such approach in Section 

Reproducible code for this simulation study is available from https: //github. com/andrewzii/ 
atminv. 

4-2. Inference on methane emissions in the UK and Ireland 

In this section we present a case study that analyses real data on mole fraction in the UK and 
Ireland from the four stations shown in Figure top-right panel, and we infer the discretised 
hux and mole-fraction helds. 

4 . 2 . 1 . Data, preprocessing, and the source-receptor relationships 

Methane observations, available in ppb, were made at four sites: Mace Head, Ireland (MHD), 
Ridge Hill, England (RGL), Tacolneston, England (TAG) and Angus, Scotland (TTA). These 
sites are from the UK Department of Energy and Climate Change (DECC) observation network 
m- In this analysis, the period January-March 2014 was used. The SRR was evaluated at the 
observation locations using the UK Met Office’s Numerical Atmospheric-dispersion Modelling 
Environment (NAME) [T], which is a Lagrangian Particle Dispersion Model (LPDM) that was 
run backwards for 30 days. The spatial domain extended from —14° to 31° E and 36° to 66° 

N at a grid resolution of 0.352° Ion by 0.234° lat, thus covering the UK and a large part of 
continental Europe with a 128 x 128 grid. For a complete description of the measurement 
procedure, ancillary information about the measurement sites, and a description of NAME and 
its application to prediction of UK and Ireland emissions, see [1]. 
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Since the LPDM only simulates the effect of emissions from the previous 30 days before 
measurement, a substantial ‘background’ level of methane exists in the observed mole fractions 
that would not be accounted for by the LPDM. This background depends on the direction of the 
origin of the air mass due to, for example, the N-S latitudinal gradient or the vertical gradient. 
We used the following simple procedure to cater for the background. First, using NAME, we 
identified which observations were influenced by southerly transport or were influenced by the 
upper atmosphere due to vertical transport by tracking the directions that particles exited 
the NAME domain. Due to the complexities involved in estimating these backgrounds, these 
observations were removed from the dataset. Second, winds that were westerly in origin and 
passed over the Atlantic ocean were considered to be relatively free of regional emissions. Hence, 
this background, due to air masses originating in the west, can be clearly observed at Mace 
Head in Ireland. We used the 5th percentile of the extant mole fraction observations at Mace 
Head as a crude estimate of the background for all observations. Third, we subtracted this 
background estimate from the extant observations at all sites. In addition to removing data 
for background purposes, observations that could have significant influence from sub-grid scale 
processes (defined as ‘local’ processes), also identified using NAME, were removed. These data 
corresponded to times when there was a significant sensitivity to the nine grid cells surrounding 
the measurement point (i.e., when air was more likely to be stagnant). 

To restrict our attention to land areas in the UK and Ireland, we used the methane flux 
inventory, together with NAME, to subtract contributions from the rest of Europe, including 
the sea territories of the UK and Ireland. Thus, we have final, extant observations of methane 
mole fraction, corrected for background and due to methane emissions only in the UK and 
Ireland land areas. The flux inventory and the NAME outputs were aggregated to a coarser 
resolution by grouping 2x2 grid cells. This is a downsampling factor of four so that finally 
\D^\ = 122 coarse-resolution grid cells over the UK and Ireland; see Figuretop panels. Since 
we were only provided with the interaction function bt{s, ■) evaluated with s corresponding to 
the four measurement station locations, we set = D^. 

4-2.2. Inference 

The final, extant dataset had 3,409 observations for the period January-March 2014. Out 
of these, we held 20% for validation purposes and used the other 80% for model-training pur¬ 
poses. The validation set was chosen by randomly sampling without replacement from all the 
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observations. 


In this application we are interested in computing the total flux by grid cell, in units g 
s“^. This can be carried out directly by adding a grid-area-dependent offset to ^/(s | i?) and 


constructing without the grid-area weights (see Appendix B). The Laplace-EM algorithm 
was conhgured to carry out 10 gradient descents at each M-step, and it converged in 20 iterations 
to the parameter estimates, 6 = (cj^,a, d)' =(690 ppb^, 0.972, 2.27°)'. The standard deviation 
of the discrepancy, ~ 26 ppb, is considerably larger than that associated with fine-scale 
temporal variability during each 2 h window and instrumentation error (~ 8.6 ppb at TAG, for 
example), and thus it is not negligible. Moreover, the magnitudes of d = 2.27° and a = 0.972 
are indicative of considerable spatial and temporal dependence. The parameter a = 0.972 
corresponds to a temporal-correlation e-folding length of —2/ln(d) = 70 h (where the factor 2 
adjusts for the temporal interval used, 2 h). 

The case study presented here is illustrative, as it only considers a crude treatment of the 
background and makes the strong assumption that the fluxes outside the UK and Ireland land 
areas are known. Yet, it is indicative to compare our results to those in [5j, where a similar 
model to ours was used, but without the lognormal spatial assumption. The spatial-correlation 
e-folding length obtained here (d = 2.27°) matches within error that obtained in [Tj (posterior 
median equal to 133 km), but the temporal-correlation e-folding length we obtain is 2-3 times 
larger than the one in [3j . Such a difference is possibly because they considered sea areas within 
the UK and Ireland territory (these were omitted from the model by us), because the result 
in [5j is averaged from several monthly studies over a 2-year period, and because we use a 
lognormal process for the flux field. 

As in the simulation study (Section |4.1[ ), we iterated the HMC sampler 10,000 times in order 
to obtain samples from the empirical predictive distribution, [Y f \ Z^, 0 , i?], with L = 20 steps 
and A G [0.070,0.071] in the HMC. These settings resulted in an acceptance rate of 54%. The 
pointwise posterior median and 95th percentile are shown in the left and right panel, respectively, 
of Figure and they should be compared with the inventory emissions map shown in Figure 
top-left panel. We obtain total estimates for the UK and Ireland (including inventory values 
for the sea territories) of 2.23 ± 0.08 Tg yr“^ and 0.37 ± 0.05 Tg yr“^, respectively. Both these 
estimates are lower than those obtained from the inventory shown in Figure top-left panel 
(2.65 Tg yr“^ and 0.64 Tg yr“^, respectively), and they support the conclusion in [3] that these 
inventory emissions might be too large. 
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Figure 7: Median (left panel) and 95th percentile (right panel) total methane emissions (fluxes) in the UK and 
Ireland by grid cell, obtained using our Laplace-EM/HMC approach. White grid cells denote regions were total 
fluxes were assumed known and used to correct the observation as described in Section [4.2.II 


Interestingly, despite the omission of a spatially varying prior distribution, the posterior 
spatial distribution of emissions (Figure is compatible somewhat with that in the inventory 
(Figure top-left panel). However, there are some key differences. Many large cities, such 
as Edinburgh, Glasgow, and Dublin, were estimated with lower methane emissions. On the 
other hand, we obtained larger emissions for the south of the UK, most notably a median total 
emission of 0.128 Tg yr“^ in the grid cell containing London, which compares to 0.062 Tg yr“^ 
in the inventory. We note that there is no consensus on methane emissions in London, but 
our result supports the observation in m that there is a two- to three-fold difference between 
inventory and measured fluxes in central London. 

In Figure we show the distributions of mole fraction at each station location during the 
month of January, together with both those observations used to train the model (black crosses) 
and those observations left out for validation (red crosses). The first thing to note is that the 
distributions of mole fractions are available at times when observations are missing. Mole- 
fraction uncertainty increases at these times, but clear patterns are also discernible (e.g., at 
RGL, the peak at t =130 is noticeable). It would be straightforward to show the distribution of 
mole fractions at other isolated locations, however this requires the interaction function to be 
evaluated at these locations too. Second, sometimes the posterior density of the mole fraction 
at a space-time coordinate (s,t) might have probability mass on the negative real axis. This 
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does not imply a negative (unphysical) mole fraction, rather an imperfect correction of the 
background, which in this example is on the order of 1900 ppb. Third, the observations used 
for validation lie within the 90% prediction intervals, with the exception of a few outliers. The 
ability to obtain what seem to be realistic predictions for out-of-sample mole fractions increases 
our confidence in inferences for Y f. However, without the availability of validation methane-flux 
data, critical predictive performance measures in trace-gas inversion cannot be obtained. 

5. Discussion 

In this article, we present a bivariate spatio-temporal model for flux and mole-fraction fields 
that allows atmospheric trace-gas inversion. We make use of efficient computational methods, 
and we include spatial correlations in the flux-field description while maintaining lognormal 
properties typical of some trace gases. We give a way to estimate these spatial correlations using 
flux inventories, and we show how to include them within our model. Importantly, uncertainties 
are captured at each layer in the model (flux level, mole-fraction level, observation level), and 
predictive distributions of mole fractions include uncertainty on the SRR discrepancy. 

However, as yet we do not capture the uncertainty in estimation of the parameters in the flux 
and discrepancy models. Consequently, adjustments might need to be made to the prediction 
intervals obtained from [Y/, Y^ | Z^, G,'^] to cater for the uncertainty in {0, i?}. Of the several 
approaches that may be used for adjustment, the parametric bootstrap approach of Laird and 
Louis [121 HU Chapter 3] appears to be readily applicable here. In the context of this work, 
one would forward-simulate an ensemble of mole-fraction observations using the parameter 
estimates {0, "0} and from each of these simulated observations, one would re-estimate the 
parameters using the Laplace-EM algorithm. Then the collection of parameter estimates can 
be used to obtain a collection of predictive densities (using HMC) that can be averaged to give 
an adjusted predictive distribution for {Yj,Ym}. Here we expect 0 to be unidentifiable, so 
we suggest bootstrapping only to adjust for uncertainty in 0. Alternatively, one might obtain 
bootstrap estimates of 0 using only the inventory data (Section |2.3[ ). 

The lognormal distribution was chosen in order to satisfy the positivity constraint typically 
imposed on methane flux. As has been noted elsewhere |13) . it has the undesirable property 
that negative fluxes occur with zero probability; negative methane fluxes are unlikely but can 
occur. This can be remedied by introducing a third, shifting parameter into the lognormal 
distribution such that P(Yf < 0) > 0. The resulting distribution, known as the three-parameter 
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Figure 8: Distributions of mole fraction (with the background mole fraction subtracted) due to UK and Ireland 
land-based emissions at the four measurement stations, where t = 1 corresponds to 01 January 2014 at 00:00, 
t = 2 corresponds to 01 January 2014 at 02:00, and so on until 31 January 2014 at 22:00. The dark and light 
shadings denote the interquartile and the 5th-95th percentile ranges, respectively. The red crosses denote the 
observations used for validation, whilst the black crosses denote those used for training. 


lognormal distribution in the univariate case [441 p. 113], can also be calibrated using existing 


inventories, in a manner similar to Section 2.3 This increases the model’s generality and, of 
course, distributions other than the lognormal may also be considered. 

For computational reasons, we assumed that the discrepancy term has a covariance function 
that is separable in space and time. However, computational limits will also be reached with 
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this approach, for increasing \T\ or \DI^\. The cost of storing an n x n dense matrix is 0{n?), 
and the computational cost of inverting it is O(n^). One approach to remedy this is to define 
sparse inverse covariance matrices for two Gaussian Markov random fields (GMRFs), one for 
space and one for time, over the space-time locations of the mole-fraction field; see [l5] for 
a comprehensive treatment of GMRFs. This sparsity reduces considerably the memory and 
computational requirements in inference. Space-time separable GMRFs have been used in the 
context of air-quality monitoring in [3^. Another approach is through reduced-rank spatio- 
temporal models [3Zl sg SH], and there spatio-temporal basis functions (but not separability) 
is required. Other possibilities are through predictive processes [50] and approximations to 
stochastic partial differential equations m- 
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Appendix A. The Laplace-EM algorithm for estimating discrepancy parameters 

We first describe some of the notation used throughout this appendix. The symbol 0 will be 
used to denote element-wise division whilst the symbol 0 denotes element-wise multiplication. 
These operators can only be used for vectors or matrices of the same size and resnlt in a vector, 
or matrix, of that size. Throughont, we will use numerator layout notation: The derivative of 
a scalar by a (column) vector returns a transposed vector, whilst the derivative of a scalar by 
a matrix X returns a matrix whose (z,j)th element corresponds to the derivative of the scalar 
with respect to the (j, i)th element of X. 

We shall also be repeatedly using the following identities for two vectors X, Y in the E-step: 

diag(Y)X = diag(X)Y, 

V = diag(l 0X), 
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where diag(X) returns a diagonal matrix with X on the diagonal. 

In the M-step of the EM algorithm we shall use the following identities for two matrices 
A,B of size p X p and q x q respectively, and scalar x: 


lA^^i = |Ar, 

^-l - A -1 


(A®B)-' = A- 


B 


-1 


|A®B| = |A|^|B|p, 

^ln|A| _ / ^A 

dx ~ \ dx 

d{A(^B) . 5B aA „ 

^ ^A® — + — ®B, 
ox ox Ox 


where in this context | • | is the determinant and (8> is the Kronecker product. 

Appendix A.l. E-Step 

In the E-step, we compute 


Q(0| 0«) = E(lnLe(0) | Z„,0«,i9), 


(A.l) 

(A.2) 


(A.3) 


that is, the expectation of the complete-data log-likelihood with respect to the conditional 
distribution of {Yj, Y^} given the data Z^, and the current parameter estimate i = 
0,1,.... Hence, in order to be able to evaluate Q{6 \ 0^*^), we first need to determine the 
conditional distribution under which to take expectations. 

Define Qq = cr“^I, and Q/('i?) = where 

= (cov(Ct(s),Ct'(u) I 0^*^) : s,u G t,t' G T), 

= {Cff{s,u\ d) :s,ue Dj). 

Eor notational convenience, from now on we shall omit the dependence of Sj and Qj on i?, and 
the dependence of and on 

The observation model for the discretised system is, 


Zm — CY^ -|- 


where C = (C) ; t G T)^ is the incidence matrix mapping the discretised mole-fraction field to 


35 



the observations. Define : t G T)' and fij = {nf{s \ •&) : s ^ ■ Then 


ln[Yj, I Z^, 0«, = ci-0.5(Z™ - CY™)'Qo(Z^ - CY^) 

-0.5(Y^-BbYj)'Q^(Y^-BsY;) (A.4) 

- 0.5(ln Yy - ^/)'Q^(lnY/ - Jlf) - (In Y/)'l. 


Due to the presence of In Yj in (A.4), this conditional distribution is not Gaussian. We sum¬ 
marise it using the first two moments, approximated using first-order and second-order deriva¬ 
tives. 


Define Dj = diag(l0Yj) and ^ff= diag(l0 (YjOYj)). Then the derivatives J of (A.4) 
with respect to Y = (Yj, Y(,^)', are given by 


/ain[Y/,Y^ I Z^,6»«,i9]\ 


J(Y I 0®,-^) = 



B'sQ^Y™ - B'sQ^BYy - D/Q^(lnY/ - Jlf) - (1 0 Y/) 
^ -C'QoCY™ + C'QoZm - Q^{Ym - BbYj) 


The Hessian H is 


/ain[Y/,Y^ I Z^,0«,^] 91n[Y/,Y^ | Z™,0«,i?]\ 


H(Y I 0^,1?) = 


5Y/Y' 


dYmY'f 


d\n[Yf,Ym I Z^,0«,^] 91n[Y/,Y^ | Z™,0«,i?] 


5Y/Y; 


8Y V' 

^ ^ m ^ m 




( 
V 


-B'^Q^Bb - D/QjD/ -h diag(Qj(ln Y/ - 

Q^Bb 


B'sQc 

-C'QoC - 


We now use a gradient-descent algorithm with the derivatives J(Y | to find the mode 

Y* of the distribution and use this as an approximation to the mean. We evaluate H at this 
mode; the approximate covariance is then given by —H(Y* | For gradient descent, 

we used the R function optim with the method BFGS. 
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From (A.3), the function Q{6 \ 0^*)) 


is 


Q(e I 0®) = -O.51n|5]<;(0)| - O.5tr(Q^(0)’If), 


where 


'J'=E(Y™Y:„ I 

+ BbE(Y/Y^ I (A.5) 

-2BsE(Y;Y;, I Z™,0«,i9). 


Note that only the first two moments of 




are required to compute (A.5). 


Appendix A.2. M-step 
In the M-step we set 


gii+A = argmaxQ(0 | 0^*^). (A.6) 

e 

The discrepancy term Ct(s) has a separable spatio-temporal covariance function. Hence, we can 
write the full space-time covariance matrix as 


S(;(0) = cr^R<^(a, d) 


(A.7) 


where recall that (8) is the Kronecker product and Rf,s; R^.t a.re correlation matrices. In the 
M-step we maximise Q{6 \ 0^)) in (A.6) using gradient descent; the following gradients are 


obtained by repeated application of (A.l) and (A.2): 
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dQ{e I 6»») _ 


da'^ 


= 0.5^—nfjj, ^ ^tr(R^(a, d) 


dQ{e I 6> W) _ 

da 


= —0.5tr| (R^ t(a) ® R^ Ad)) 

d^UA 




+ 0.5(t^ ^tiYRj, ^(a, d) f- 


da 


K^,s{d)]Ti-\a,d)^ 


dQ{e I 6»W) 


= -0.5tr^(R^,t(a) (g) Rc,s('^)) ^ ^R-C,t(a) ® 

+ O.Sa^-^tr (^R^i(a, d) (^Rc,t(a) ® R-C, 


where here n = |Z)^||T|. In the equations above, the mixed-product property of the Kronecker 
product can be used to simplify the computations; for example, 




/ gRa(«) 

V da 


(g) R(^^5(d) 



gRc.t(«) ^ 

da ) 


(g)I. 


All that remains is to specify the partial derivatives of the spatial and temporal correlation 
matrices with respect to a and d. These are obtained from the derivatives of the correlation 
functions: 


= \t - |a| < 1, 

da 

d , , llu —s|| f llu —s||\ 

Appendix B. Directly inferring the total flux by grid cell 

Instead of computing the flux (a density that is per unit area), running the EM-Laplace/HMC 
algorithm, and rescaling again to obtain the total flux, we can instead infer the total flux by 
grid cell, {A(u)y/-(u) : u G directly with only a few modifications. Write down the 
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approximation to the integral in Q as 


/ 6t(s,u)y/(u)du~ V 6i(s,u)(^(u)y/(u)) 

= Y1 ^i(s>u)y“(u), 

u&Dj 

where Yj°^{u) = yl(u)lj(u). Now we seek a model for Yj°^{u) ; however, this poses no additional 
work since 

y/°*(s) = lny/°*(s) = lny4(s) + lny/(s). (B.l) 

It thus follows that Yj°^ is also a lognormal spatial process, shifted by a known quantity. From 


E(y“(s) I 1?) = In A(s) + fif{s I 1?), 
cov(y/°*(s),y/°‘(u) 1 1?) = Cff{s,u\'d). 

Note that the covariance function remains unchanged, and in order to solve directly for Yj°^{-) 
one uses = lnj4(s) + | i?), instead of /i/(s | in Q and Q. Then is 

constructed without the area-weight terms since they have already been accounted for. That is, 
in this case the matrix is given by. 


= (btis, u) : s e Di^; u € Df). 
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